Skip to content

Support eventdb to record reported alarms#5

Merged
jjin62 merged 3 commits into202411_otnfrom
202411_otn_alarm_eventdb_upstream
Mar 20, 2026
Merged

Support eventdb to record reported alarms#5
jjin62 merged 3 commits into202411_otnfrom
202411_otn_alarm_eventdb_upstream

Conversation

@dudu579
Copy link
Copy Markdown

@dudu579 dudu579 commented Mar 17, 2026

Why I did it

Work item tracking

How I did it

  • Per ONIE_PLATFORM (device) owns their own alarm list: sonic-buildimage/device/molex/x86_64-otn-kvm_x86_64-r0/default.json
  • Eventd container installs sonic-eventd-otn-profile debian package to apply device level alarm list.
  • sonic-eventd-otn-profile will copy ONIE_PLATFORM alarm list and syslog plugin to eventd.
Architecture
orchagent (docker-swss)
  │  SWSS_LOG_NOTICE → syslog tag: swss#orchagent
  ▼
Host rsyslogd
  │  matches programname "orchagent" (partial match via otn_events.conf)
  │  action: omprog → rsyslog_plugin -r /etc/rsyslog.d/otn_regex.json -m sonic-events-otn
  ▼
rsyslog_plugin  (host process)
  │  regex extracts: severity, resource, type-id, action
  │  publishes JSON to ZMQ tcp://127.0.0.1:5570
  ▼
eventd (docker-eventd) — ZMQ proxy XSUB:5570 → XPUB:5571
  ▼
eventdb / eventconsume.cpp (docker-eventd) — subscribes 5571
  │  looks up type-id in /etc/evprofile/default.json  (inside container)
  │  writes EVENT table
  │  action=RAISE/CLEAR → writes ALARM table
  ▼
Redis EVENT_DB (DB 19)
  ├── EVENT|<id>   { type-id, text, time-created, action, resource, severity }
  ├── EVENT_STATS
  ├── ALARM|<id>   { type-id, text, time-created, acknowledged, resource, severity }
  └── ALARM_STATS

Bug fix and changes

  • evendb does not start inside evend: priority.
  • Add OTN-OA min&max gain range in device config.
  • Possible failure on parallel compiling.(events test timeout)

@dudu579 dudu579 requested review from jjin62, oplklum and pkable March 17, 2026 17:38
@dudu579 dudu579 self-assigned this Mar 17, 2026
@dudu579
Copy link
Copy Markdown
Author

dudu579 commented Mar 17, 2026

Improvements:

  • Try to use a general type-id with description(descripion->type-id from default.json in sonic-swss(orchagent))
  • Record description in eventdb with text
  • lower-case
  • Desired alarm-syslog format: timestamp, alarm-severity, nodename:entity, type-id, description, state
  • Alarm does not always have actions. Think about how to decide it
  • Make sure alarm flapping will be blocked by eventdb. (Found this bug in SONiC...)

@dudu579
Copy link
Copy Markdown
Author

dudu579 commented Mar 18, 2026

@jjin62 For the description -> type-id mirror, there is a concern over here. If we do this, it will drop the severity field automatically since they are using the same type-id's severity. Attached my experiment:

root@sonic:~# logger -t "swss#orchagent" "2026-03-17T22:46:16Z, CRITICAL, sonic:OA0-0, Out of Gain Range, RAISE"

root@sonic:~# redis-cli -n 19 hgetall "ALARM|10"
 1) "time-created"
 2) "1773797133627957758"
 3) "action"
 4) "RAISE"
 5) "resource"
 6) "sonic:OA0-0"
 7) "type-id"
 8) "Out of Gain Range"
 9) "severity"
10) "MINOR"
11) "id"
12) "10"
13) "acknowledged"
14) "false" 

I tried to raise a CRITICAL alarm, but finally it used Out of Gain Range's severity which is defined in default.json. I do think we need to keep type-id = alarm-id and per alarm uses their own severity.(Same as SAI defination)

@dudu579
Copy link
Copy Markdown
Author

dudu579 commented Mar 18, 2026

Improvements Conflicts
General type-id with description(descripion->type-id : default.json) Above severity tight coupling
Record description in eventdb with text No conflicts
lower-case If we need to follow up the naming rules in sonic-yang? They defined action/severity as an enumeration(ALL UPPER CASE). If not, we can change it.
Desired alarm-syslog format No conflicts
Alarm does not always have actions. Think about how to decide it Basing on consumer codes, they are distinguishing alarms and events basing on action. Alarm has an action but events do not. We could follow on their rules. Alarms are the subset of Events.
Make sure alarm flapping will be blocked by eventdb. (Found this bug in SONiC...) Molex should not have the responsibility to fix this. Or this is not a bug at all. SAI or driver should never raise the same alarms.

@dudu579 dudu579 force-pushed the 202411_otn_alarm_eventdb_upstream branch 2 times, most recently from 7c6e83b to 6e94484 Compare March 18, 2026 22:32
@dudu579
Copy link
Copy Markdown
Author

dudu579 commented Mar 19, 2026

Try to use sonic-event.yang to generate command lines. Right now they are working. https://github.com/sonic-molex/sonic-swss only supports events nor alarms.(Out of Gain Range is not a good demo case. Eventdb will treat event with action syslog as an alarm.)

Type syslog format Example
Event timestamp, alarm-severity, nodename:entity, type-id, description logger -t "swss#orchagent" "2026-03-05T22:47:16Z, MINOR, sonic:OA0-0, Out of Gain Range, test"
Alarm timestamp, alarm-severity, nodename:entity, type-id, description, action logger -t "swss#orchagent" "2026-03-17T22:46:16Z, CRITICAL, sonic:OA0-0, Out of Gain Range, Out of Gain Range, RAISE"

admin@sonic:~$ show event
ID    EVENT STATE
----  -------------
admin@sonic:~$ show event-stats
ID    EVENTS    RAISED    ACKED    CLEARED
----  --------  --------  -------  ---------
admin@sonic:~$ config otn-oa update OA0-0 --target-gain 40
Root privileges are required for this operation
admin@sonic:~$ sudo -i
root@sonic:~# config otn-oa update OA0-0 --target-gain 40
sonic_yang(6):Note: Below table(s) have no YANG models: OTN_OCM, OTN_OCM_CHANNEL, OTN_WSS, OTN_WSS_SPEC_POWER
sonic_yang(6):Note: Below table(s) have no YANG models: OTN_OCM, OTN_OCM_CHANNEL, OTN_WSS, OTN_WSS_SPEC_POWER
root@sonic:~# cat /usr/bin/yang_auto_cli.sh ^C
root@sonic:~# ^C
root@sonic:~# show event
  ID  EVENT STATE
----  -----------------------------------------------------------------
   1  resource:      sonic:OA0-0
      text:          Target gain value 4000 is out of range [300, 2000]
      time-created:  1773879207694653367
      type-id:       Out of Gain Range
      severity:      MINOR
root@sonic:~# show event-stats
ID       EVENTS    RAISED    ACKED    CLEARED
-----  --------  --------  -------  ---------
state         1         0        0          0

Support sonic-event.yang CLI generation in yang_auto_cli.sh
@dudu579 dudu579 force-pushed the 202411_otn_alarm_eventdb_upstream branch from 6e94484 to ad1948e Compare March 19, 2026 00:31
@dudu579
Copy link
Copy Markdown
Author

dudu579 commented Mar 19, 2026

Open questions:

  • In default.json, they mentioned: "use 'event profile ' command to apply that profile without having to send SIGINT to eventd." But right now I can not run event command in SONiC at all... I believe it is hiding in some PRs. Event Management support sonic-net/sonic-mgmt-framework#85. Using this way we can avoid installing debian package.
  • Make sure alarm flapping will be blocked by eventdb. (Found this bug in SONiC...)
  • sonic-alarm.yang is not extended with sonic-yang. If we want show alarm enable in SONiC, we need to change sonic-alarm.yang.

@jjin62 jjin62 merged commit e131fdc into 202411_otn Mar 20, 2026
1 check passed
@jjin62 jjin62 deleted the 202411_otn_alarm_eventdb_upstream branch March 20, 2026 18:08
oplklum pushed a commit that referenced this pull request Mar 31, 2026
* Support eventdb to record reported alarms

* Add description in alarmDB record
Support sonic-event.yang CLI generation in yang_auto_cli.sh
oplklum pushed a commit that referenced this pull request Apr 1, 2026
* Support eventdb to record reported alarms

* Add description in alarmDB record
Support sonic-event.yang CLI generation in yang_auto_cli.sh
jjin62 pushed a commit that referenced this pull request Apr 1, 2026
…net#25643)

* [build] Add build timing report and dependency analysis tools

Add three scripts for build performance instrumentation:

- scripts/build-timing-report.sh: Parse per-package timing from build
  logs (HEADER/FOOTER timestamps), generate sorted duration table,
  phase breakdown, parallelism timeline, and CSV export.

- scripts/build-dep-graph.py: Parse rules/*.mk dependency graph,
  compute critical path, fan-out/fan-in bottleneck analysis, and
  generate DOT/JSON output for visualization.

- scripts/build-resource-monitor.sh: Sample CPU, memory, disk I/O,
  and Docker container count during builds for resource utilization
  analysis.

Add "make build-report" target to slave.mk that runs the timing
report and dependency analysis after a build completes.

Example output from a VS build on 24-core/30GB machine:
- 210 packages built in 53m wall time (173m CPU)
- Max concurrency: 5 (with SONIC_CONFIG_BUILD_JOBS=4)
- Critical path: 14 packages deep (libnl -> libswsscommon -> utilities)
- Top bottleneck: LIBSWSSCOMMON with 48 downstream dependents

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>

* Address Copilot review: fix 17 bugs in build analysis scripts

- Use free -m with division instead of free -g to avoid rounding (#1)
- Add = and ?= to Makefile dependency regex patterns (#2, sonic-otn#7)
- CPU calculation now uses /proc/stat delta (two reads) (#3, sonic-otn#14)
- Fix misleading 'critical path estimate' comment (#4)
- Fix parallelism timeline comment (60s not 10s) (#5)
- Include after-relationship packages in fan stats (#6)
- Guard disk I/O division by zero when INTERVAL<=1 (sonic-otn#8)
- Remove unused elapsed_line variable (sonic-otn#9)
- Remove redundant LIBSWSSCOMMON_DBG check (sonic-otn#10)
- Remove active_make_jobs from CSV header comment (sonic-otn#11)
- Wire up _RDEPENDS parsing to build reverse deps (sonic-otn#12)
- Remove unnecessary 'if v' filter on rdeps JSON (sonic-otn#13)
- Remove unused REPORT_FORMAT parameter (sonic-otn#15)
- Add cycle detection to critical path algorithm (sonic-otn#16)
- Add execute permission check for companion scripts (sonic-otn#17)

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>

---------

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>
Co-authored-by: Rustiqly <rustiqly@users.noreply.github.com>
jjin62 pushed a commit that referenced this pull request Apr 1, 2026
…dating udevd rules (sonic-net#26343)

- Why I did it
On SONiC SmartSwitch platforms with DPUs, systemd-udevd crashes with SIGABRT on every reboot when DPU firmware initialization is slow. During the initramfs boot phase, a standalone systemd-udevd daemon is started to handle device discovery. If DPU firmware takes longer than the 60-second udevadm settle timeout (BlueField-3 DPUs can take 120 seconds each in the failure case when they are stuck), the initramfs cannot stop this udevd before switch_root. The stale process survives into the real system but is never chrooted into the overlayfs root, leaving it with a broken filesystem view. When dpu-udev-manager.sh writes udev rules, the stale udevd detects the change and crashes on an assertion in systemd's chase() path resolution (assert(path_is_absolute(p)) at chase.c:648), because dir_fd_is_root() returns false for a process whose root still points to the initramfs rootfs rather than the overlayfs.

This triggers a systemd issue : systemd/systemd#29559 which maintainers doesn't consider as a bug from systemd side. Raising this fix for our usecase.

Core was generated by `/usr/lib/systemd/systemd-udevd --daemon --resolve-names=never'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f29fe7f695c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007f29fe7f695c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f29fe7a1cc2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f29fe78a4ac in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f29fea50c11 in ?? () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
#4  0x00007f29feb94a8b in chase () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
#5  0x00007f29feb956e2 in chase_and_opendir () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
#6  0x00007f29feb9a609 in conf_files_list_strv () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
sonic-otn#7  0x00007f29fea913e8 in config_get_stats_by_path () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
sonic-otn#8  0x0000559f295519cf in ?? ()
sonic-otn#9  0x0000559f29553a77 in ?? ()
sonic-otn#10 0x00007f29fec36055 in ?? () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
sonic-otn#11 0x00007f29fec3668d in sd_event_dispatch () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
sonic-otn#12 0x00007f29fec394a8 in sd_event_run () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
sonic-otn#13 0x00007f29fec396c7 in sd_event_loop () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
sonic-otn#14 0x0000559f29545820 in ?? ()
sonic-otn#15 0x00007f29fe78bca8 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
sonic-otn#16 0x00007f29fe78bd65 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
sonic-otn#17 0x0000559f29545c51 in ?? ()

- How I did it
Added a kill_stale_udevd() function to dpu-udev-manager.sh that runs before writing the udev rules. It identifies the systemd-managed udevd PID via systemctl show, then kills any other systemd-udevd --daemon process that doesn't match -- these are leftover initramfs instances. If no stale process exists (e.g. DPUs are healthy and the initramfs udevd exited cleanly), the function is a no-op.

- How to verify it
Deploy the image on a SmartSwitch with DPUs in a state where firmware initialization times out (>60s per DPU) by stopping image installation before firmware install step
Reboot the switch
Verify no new systemd-udevd coredumps in /var/core/
Verify the stale process was killed: journalctl -b 0 | grep dpu-udev-manager should show killing stale initramfs udevd PID (systemd udevd is PID )
Verify systemd-udevd.service is healthy: systemctl status systemd-udevd should show active (running)
Verify DPU udev rules were written: cat /etc/udev/rules.d/92-midplane-intf.rules should contain the DPU interface naming rules

Signed-off-by: Hemanth Kumar Tirupati <tirupatihemanthkumar@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants